Introduction for OpenVemo: A Rapid Content Creation Platform

Project Origin

This project addresses repetitive workflows observed during video blog and short video production within our team. By studying audio/video content production patterns and integrating cutting-edge AI models, we aim to automate key creative processes while applying open-source technologies to broader scenarios.

Target Use Cases

User Input → Story Framework Generation → Video-optimized Script Conversion → Voice/Image Synthesis → Export Draft (for Human Refinement)

Core AI Model Selection

DeepSeek-V3: Story framework generation & structured dialogue creation
Kokoro TTS: Multi-character voice synthesis
FLUX-1 dev: Context-aware image generation

Key Features

1. Theme-based Story Generation

python

@app.post("/generate-story")
async def create_story(theme: str):
    """Generate 200-300 word structured narratives via DeepSeek API"""
    response = openai.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": "You are a professional writer skilled in classical narrative structures..."},
            {"role": "user", "content": f"Create a short story about '{theme}' with character conflicts and resolution"}
        ],
        temperature=0.85
    )
    return format_story(response.choices[0].message.content)

2. Natural Voice Synthesis

python

def tts_pipeline(text: str, voice: str) -> bytes:
    """11 preset voice options (expandable)"""
    response = tts_client.audio.speech.create(
        model="kokoro",
        voice=voice,
        input=text,
        speed=1.1,
        pitch=0.8
    )
    return response.content

Current Limitation: Most English Voice work well,Chinese and the other voices support requires community improvements

3. Context-aware Image Generation

python

@app.post("/generate-scene")
async def render_image(prompt: str):
    payload = {
        "model": "FLUX.1-dev",
        "prompt": f"best quality, 4k, {prompt}",
        "negative_prompt": "blurry, lowres",
        "seed": int(time.time() % 1000)
    }
    response = requests.post(SILICONFLOW_ENDPOINT, json=payload)
    return response.json()["images"][0]["url"]

System Architecture

mermaid

graph TD
    A[User Request] --> B(FastAPI)
    B --> C{Request Type}
    C -->|Text Generation| D[DeepSeek API]
    C -->|Voice Synthesis| E[Kokoro Engine]
    C -->|Image Generation| F[SiliconFlow API]
    D --> G[Data Processing & State Management]
    E --> G
    F --> G
    G --> H[Response Output]

Data Models

python

class Section(BaseModel):
    text: str
    voice: str

class StoryRequest(BaseModel):
    theme: str

class ImageRequest(BaseModel):
    prompt: str
    sectionId: str
    seed: int = 123

Traffic Management

Rate Limiting Strategy (5 requests/day/IP):

python

@app.middleware("http")
async def rate_limiter(request: Request, call_next):
    client_ip = request.client.host
    today = datetime.now().strftime("%Y-%m-%d")
    
    if client_ip not in rate_limit_store:
        rate_limit_store[client_ip] = {"count":1, "date":today}
    else:
        record = rate_limit_store[client_ip]
        if record["date"] != today:  # Daily reset
            record.update({"count":1, "date":today})
        elif record["count"] >= 5:   # Throttling
            return JSONResponse(status_code=429, content={"error": "Daily limit exceeded"})
        else:
            record["count"] +=1
    return await call_next(request)

Roadmap

Session-based user tracking
Database integration for usage records
Custom API endpoint configuration
Enhanced multilingual support

Efficiency Metrics

Process Stage	Traditional Workflow	OpenVemo	Improvement
Story Creation	1-2 hours	35s	2000%
Voiceover	1-2 hours	55s	6000%
Scene Imagery	3 hours	1min/image	18000%

Live Demo: openvemo.demo
Note: Service stability may vary due to AI provider limitations

OpenVemo-打造一款简单的故事内容快速创作工具

开发背景

项目源于最近部门小组在制作简单的视频博客、短视频内容时面临的不少重复性工作流问题，通过对音频，视频内容生产规律的研究，整合现有的AI大模型能力争取在现有工作的基础上实现更多流程的自动化，也希望能将open source的技术运用到更多的场景中。

确定身边人群的几个典型应用场景：
用户输入主题 → 生成故事框架 → 转换为适应视频的对话脚本 → 生成配音与配图 → 导出完整内容草稿（供后续进一步人工修改）此次项目中的用到的大模型选型主要有以下几种

DeepSeek-V3：故事内容框架与对话文本结构化生成
Kokoro TTS：实现多角色语音合成
FLUX-1 dev：用于生成内容所需的图像配图

工具用到的主要场景

基于主题的故事框架生成

工具的基本功能包括基于给定的主题，生成短片内容的故事框架，也是简单的框架脚本。

python

@app.post("/generate-story")
async def create_story(theme: str):
    """通过openai接口调用deepseek api 生成300字以内带一定基础叙述结构的故事"""
    response = openai.chat.completions.create(
        model="deepseek-chat",
        messages=[{
            "role": "system", 
            "content": "你是一个专业作家，擅长构建起承转合的故事结构..."
        },{
            "role": "user", 
            "content": f"创作关于'{theme}'的短篇故事，包含人物冲突和结局"
        }],
        temperature=0.85
    )
    return format_story(response.choices[0].message.content)

生成的故事框架内容如下，目前设定能稳定输出在200-300字之间的故事框架。

故事主题生成

尽可能拟人的语音合成

经过国内外多种tts生成模型的对比，最终选定了kokoro tts，因其角色声音在当前足够逼真，而且模型需要的空间，消耗在可接受的范围之内，最终选用了kokoro-fastapi-cpu的版本，容器能进一步运用vps上cpu的处理去生成tts的配音，同时项目音频存放于本地硬盘里一段时间，给用户选择在一定时间内（不离开页面），对配音进行下载加工再处理。

python

def tts_pipeline(text: str, voice: str) -> bytes:
    """目前11种音色转换，硬编码，后续考虑加入更多音色"""
    response = tts_client.audio.speech.create(
        model="kokoro",
        voice=voice,
        input=text,
        speed=1.1,
        pitch=0.8
    )
    return response.content

kokoro的不足之处在于，当前对于中文的支持还不够友好，期待未来社区的更新支持。配图生成

图像生成选型

这一步为生成故事框架中每一步所需的图像，以往不少图像模型生成出来的作品不能很好的与提示词，主题拟合，得益于flux生图模型的开放，选用通过优化提示词，对上述生成的每一部分故事框架，处理为单独生图，页面最后也提供了打包整合，或是下载处理的选择。

python

@app.post("/generate-scene")
async def render_image(prompt: str):
    payload = {
        "model": "FLUX.1-dev",
        "prompt": f"best quality, 4k, {prompt}",
        "negative_prompt": "blurry, lowres",
        "seed": int(time.time() % 1000)
    }
    response = requests.post(SILICONFLOW_ENDPOINT, json=payload)
    return response.json()["images"][0]["url"]

配图生成

数据模型与语音合成的实现

python

class GenerateRequest(BaseModel):
    text: str
    voice: str = "af_nicole"

class Section(BaseModel):
    text: str
    voice: str

class StoryRequest(BaseModel):
    theme: str

class ScriptRequest(BaseModel):
    story: str

class PodcastRequest(BaseModel):
    topic: str

class ImagePromptRequest(BaseModel):
    text: str
    context: Optional[str] = None

class ImageRequest(BaseModel):
    prompt: str
    sectionId: str
    seed: int = 123

class ImageSection(BaseModel):
    id: str
    text: str

class DownloadRequest(BaseModel):
    images: List[dict]
    theme: Optional[str] = None

class TranslationRequest(BaseModel):
    script: str

# tts路由
@app.get("/voices")
async def get_voices():
    voices = [
        {"id": "af", "name": "Default", "language": "en-us", "gender": "Female"},
        {"id": "af_bella", "name": "Bella", "language": "en-us", "gender": "Female"},
        {"id": "af_nicole", "name": "Nicole", "language": "en-us", "gender": "Female"},
        {"id": "af_sarah", "name": "Sarah", "language": "en-us", "gender": "Female"},
        {"id": "af_sky", "name": "Sky", "language": "en-us", "gender": "Female"},
        {"id": "am_adam", "name": "Adam", "language": "en-us", "gender": "Male"},
        {"id": "am_michael", "name": "Michael", "language": "en-us", "gender": "Male"},
        {"id": "bf_emma", "name": "Emma", "language": "en-gb", "gender": "Female"},
        {"id": "bf_isabella", "name": "Isabella", "language": "en-gb", "gender": "Female"},
        {"id": "bm_george", "name": "George", "language": "en-gb", "gender": "Male"},
        {"id": "bm_lewis", "name": "Lewis", "language": "en-gb", "gender": "Male"}
    ]
    return voices

tts配音接口与本地存储处理

python

@app.post("/generate-and-merge")
async def generate_and_merge(sections: List[Section]):
    async def generate_stream():
        timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
        output_dir = BASE_DIR / "output" / timestamp
        output_dir.mkdir(parents=True, exist_ok=True)
        temp_dir = BASE_DIR / "temp" / timestamp
        temp_dir.mkdir(parents=True, exist_ok=True)
        
        audio_files = []
        
        try:
            for i, section in enumerate(sections):
                yield json.dumps({
                    "type": "progress",
                    "current": i+1,
                    "total": len(sections),
                    "message": f"Generating audio for section {i+1}/{len(sections)}"
                }) + "\n"
                
                response = tts_client.audio.speech.create(
                    model="kokoro",
                    voice=section.voice,
                    input=section.text
                )
                
                temp_file = temp_dir / f"temp_{i}.wav"
                response.stream_to_file(temp_file)
                audio_files.append(temp_file)
                await asyncio.sleep(0.1)

            yield json.dumps({
                "type": "status",
                "message": "Merging audio files..."
            }) + "\n"

            combined = AudioSegment.empty()
            for f in audio_files:
                combined += AudioSegment.from_wav(f)
            
            final_path = output_dir / "audio.wav"
            combined.export(final_path, format="wav")
            
            yield json.dumps({
                "type": "complete",
                "success": True,
                "filename": f"output/{timestamp}/audio.wav"
            }) + "\n"
            
        except Exception as e:
            yield json.dumps({
                "type": "error",
                "error": str(e)
            }) + "\n"
        finally:
            for f in audio_files:
                try: f.unlink()
                except: raise HTTPException(400, detail="Invalid addio format")
            try: temp_dir.rmdir()
            except: raise HTTPException(400, detail="Invalid addio format")

翻译部分

python

@app.post("/translate-podcast")
async def translate_podcast(request: TranslationRequest):
    try:
        response = openai_client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {"role": "system", "content": "你是一名专业的地道的中文翻译家..."},
                {"role": "user", "content": f"Translate this script:\n{request.script}"}
            ]
        )
        return {
            "success": True,
            "translation": response.choices[0].message.content
        }
    except Exception as e:
        raise HTTPException(500, detail={"success": False, "error": str(e)})

@app.post("/translate-story-script")
async def translate_story_script(request: TranslationRequest):
    try:
        response = openai_client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {"role": "system", "content": "你是一名专业的地道的中文故事翻译家..."},
                {"role": "user", "content": f"Translate this script:\n{request.script}"}
            ]
        )
        return {
            "success": True,
            "translation": response.choices[0].message.content
        }
    except Exception as e:
        raise HTTPException(500, detail={"success": False, "error": str(e)})

关于流量管控

工具涉及的deepseek和siliconflow各AI服务资源还处于十分紧张的阶段，因此我在项目里对api的接口请求次数做了相应限制，主要是云托管的版本（5次每天/IP），当天清零，隔天重新再计时的策略，该部分设计中考虑中间件，计时器缓存的设计，由于项目为临时托管云服务器，同时考虑请求次数增多的用户体验情况，目前将计时器数据等缓存于内存中。

python

# 限流中间件核心逻辑
@app.middleware("http")
async def rate_limiter(request: Request, call_next):
    client_ip = request.client.host
    today = datetime.now().strftime("%Y-%m-%d")
    
    # 计数器初始化
    if client_ip not in rate_limit_store:
        rate_limit_store[client_ip] = {"count":1, "date":today}
    else:
        record = rate_limit_store[client_ip]
        if record["date"] != today:  # 跨日重置
            record.update({"count":1, "date":today})
        elif record["count"] >= 5:   # 触发限流
            return JSONResponse(
                status_code=429,
                content={"error": "每日请求上限5次"}
            )
        else:
            record["count"] +=1
    
    return await call_next(request)

总体技术架构一览

系统设计

graph

    A[用户请求] --> B(FastAPI)
    B --> C{请求类型}
    C -->|文本生成| D[DeepSeek api]
    C -->|语音合成| E[Kokoro引擎]
    C -->|图像生成| F[SiliconFlowapi]
    D --> G[数据转换，组装输出，状态存储]
    E --> G
    F --> G
    G --> H[响应输出]

这里也说说项目后续的更新考虑：

结合session再实现分浏览器用户的精准计数
创建单独的定时任务每日清理计时器记录
用户数据考虑更改为数据库存储
增加用户可选的deepseek和flux生图api自定义，根据用量决定使用时长，不局限于本机api限制。

📊 相对之前项目的效率提升对比

根据项目的愿景，如果各项功能均能稳定运营，相对传统简单视频博客场景，都有了不少提升

创作环节	传统耗时	本系统耗时	效率提升
故事创作	1-2小时	~35秒	20x
视频配音	1-2小时	~55秒	60x
场景配图	3小时	1分钟(单图)	200x

最后放上项目体验地址

ps：由于deepseek和siliconflow flux服务均有限制且未稳定，线上地址供初步体验。

在线体验：openvemo

Introduction for OpenVemo: A Rapid Content Creation Platform ​

Project Origin ​

Target Use Cases ​

Core AI Model Selection ​

Key Features ​

1. Theme-based Story Generation ​

2. Natural Voice Synthesis ​

3. Context-aware Image Generation ​

System Architecture ​

Data Models ​

Traffic Management ​

Roadmap ​

Efficiency Metrics ​

OpenVemo-打造一款简单的故事内容快速创作工具 ​

开发背景 ​

工具用到的主要场景 ​

基于主题的故事框架生成 ​

尽可能拟人的语音合成 ​

图像生成选型 ​

数据模型与语音合成的实现 ​

tts配音接口与本地存储处理 ​

翻译部分 ​

关于流量管控 ​

总体技术架构一览 ​

系统设计 ​

📊 相对之前项目的效率提升对比 ​

Introduction for OpenVemo: A Rapid Content Creation Platform

Project Origin

Target Use Cases

Core AI Model Selection

Key Features

1. Theme-based Story Generation

2. Natural Voice Synthesis

3. Context-aware Image Generation

System Architecture

Data Models

Traffic Management

Roadmap

Efficiency Metrics

OpenVemo-打造一款简单的故事内容快速创作工具

开发背景

工具用到的主要场景

基于主题的故事框架生成

尽可能拟人的语音合成

图像生成选型

数据模型与语音合成的实现

tts配音接口与本地存储处理

翻译部分

关于流量管控

总体技术架构一览

系统设计

📊 相对之前项目的效率提升对比